{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Finding Similar Movies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We'll start by loading up the MovieLens dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "r_cols = ['user_id', 'movie_id', 'rating']\n",
    "ratings = pd.read_csv('e:/sundog-consult/udemy/datascience/ml-100k/u.data', sep='\\t', names=r_cols, usecols=range(3), encoding=\"ISO-8859-1\")\n",
    "\n",
    "m_cols = ['movie_id', 'title']\n",
    "movies = pd.read_csv('e:/sundog-consult/udemy/datascience/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding=\"ISO-8859-1\")\n",
    "\n",
    "ratings = pd.merge(movies, ratings)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie_id</th>\n",
       "      <th>title</th>\n",
       "      <th>user_id</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>308</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>287</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>148</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>280</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>66</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movie_id             title  user_id  rating\n",
       "0         1  Toy Story (1995)      308       4\n",
       "1         1  Toy Story (1995)      287       5\n",
       "2         1  Toy Story (1995)      148       4\n",
       "3         1  Toy Story (1995)      280       4\n",
       "4         1  Toy Story (1995)       66       3"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>title</th>\n",
       "      <th>'Til There Was You (1997)</th>\n",
       "      <th>1-900 (1994)</th>\n",
       "      <th>101 Dalmatians (1996)</th>\n",
       "      <th>12 Angry Men (1957)</th>\n",
       "      <th>187 (1997)</th>\n",
       "      <th>2 Days in the Valley (1996)</th>\n",
       "      <th>20,000 Leagues Under the Sea (1954)</th>\n",
       "      <th>2001: A Space Odyssey (1968)</th>\n",
       "      <th>3 Ninjas: High Noon At Mega Mountain (1998)</th>\n",
       "      <th>39 Steps, The (1935)</th>\n",
       "      <th>...</th>\n",
       "      <th>Yankee Zulu (1994)</th>\n",
       "      <th>Year of the Horse (1997)</th>\n",
       "      <th>You So Crazy (1994)</th>\n",
       "      <th>Young Frankenstein (1974)</th>\n",
       "      <th>Young Guns (1988)</th>\n",
       "      <th>Young Guns II (1990)</th>\n",
       "      <th>Young Poisoner's Handbook, The (1995)</th>\n",
       "      <th>Zeus and Roxanne (1997)</th>\n",
       "      <th>unknown</th>\n",
       "      <th>Á köldum klaka (Cold Fever) (1994)</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>user_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 1664 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "title    'Til There Was You (1997)  1-900 (1994)  101 Dalmatians (1996)  \\\n",
       "user_id                                                                   \n",
       "0                              NaN           NaN                    NaN   \n",
       "1                              NaN           NaN                    2.0   \n",
       "2                              NaN           NaN                    NaN   \n",
       "3                              NaN           NaN                    NaN   \n",
       "4                              NaN           NaN                    NaN   \n",
       "\n",
       "title    12 Angry Men (1957)  187 (1997)  2 Days in the Valley (1996)  \\\n",
       "user_id                                                                 \n",
       "0                        NaN         NaN                          NaN   \n",
       "1                        5.0         NaN                          NaN   \n",
       "2                        NaN         NaN                          NaN   \n",
       "3                        NaN         2.0                          NaN   \n",
       "4                        NaN         NaN                          NaN   \n",
       "\n",
       "title    20,000 Leagues Under the Sea (1954)  2001: A Space Odyssey (1968)  \\\n",
       "user_id                                                                      \n",
       "0                                        NaN                           NaN   \n",
       "1                                        3.0                           4.0   \n",
       "2                                        NaN                           NaN   \n",
       "3                                        NaN                           NaN   \n",
       "4                                        NaN                           NaN   \n",
       "\n",
       "title    3 Ninjas: High Noon At Mega Mountain (1998)  39 Steps, The (1935)  \\\n",
       "user_id                                                                      \n",
       "0                                                NaN                   NaN   \n",
       "1                                                NaN                   NaN   \n",
       "2                                                1.0                   NaN   \n",
       "3                                                NaN                   NaN   \n",
       "4                                                NaN                   NaN   \n",
       "\n",
       "title                   ...                  Yankee Zulu (1994)  \\\n",
       "user_id                 ...                                       \n",
       "0                       ...                                 NaN   \n",
       "1                       ...                                 NaN   \n",
       "2                       ...                                 NaN   \n",
       "3                       ...                                 NaN   \n",
       "4                       ...                                 NaN   \n",
       "\n",
       "title    Year of the Horse (1997)  You So Crazy (1994)  \\\n",
       "user_id                                                  \n",
       "0                             NaN                  NaN   \n",
       "1                             NaN                  NaN   \n",
       "2                             NaN                  NaN   \n",
       "3                             NaN                  NaN   \n",
       "4                             NaN                  NaN   \n",
       "\n",
       "title    Young Frankenstein (1974)  Young Guns (1988)  Young Guns II (1990)  \\\n",
       "user_id                                                                       \n",
       "0                              NaN                NaN                   NaN   \n",
       "1                              5.0                3.0                   NaN   \n",
       "2                              NaN                NaN                   NaN   \n",
       "3                              NaN                NaN                   NaN   \n",
       "4                              NaN                NaN                   NaN   \n",
       "\n",
       "title    Young Poisoner's Handbook, The (1995)  Zeus and Roxanne (1997)  \\\n",
       "user_id                                                                   \n",
       "0                                          NaN                      NaN   \n",
       "1                                          NaN                      NaN   \n",
       "2                                          NaN                      NaN   \n",
       "3                                          NaN                      NaN   \n",
       "4                                          NaN                      NaN   \n",
       "\n",
       "title    unknown  Á köldum klaka (Cold Fever) (1994)  \n",
       "user_id                                               \n",
       "0            NaN                                 NaN  \n",
       "1            4.0                                 NaN  \n",
       "2            NaN                                 NaN  \n",
       "3            NaN                                 NaN  \n",
       "4            NaN                                 NaN  \n",
       "\n",
       "[5 rows x 1664 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')\n",
    "movieRatings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's extract a Series of users who rated Star Wars:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "user_id\n",
       "0    5.0\n",
       "1    5.0\n",
       "2    5.0\n",
       "3    NaN\n",
       "4    5.0\n",
       "Name: Star Wars (1977), dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "starWarsRatings = movieRatings['Star Wars (1977)']\n",
    "starWarsRatings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Pandas' corrwith function makes it really easy to compute the pairwise correlation of Star Wars' vector of user rating with every other movie! After that, we'll drop any results that have no data, and construct a new DataFrame of movies and their correlation score (similarity) to Star Wars:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Frank\\AppData\\Local\\Enthought\\Canopy\\edm\\envs\\User\\lib\\site-packages\\numpy\\lib\\function_base.py:2487: RuntimeWarning: Degrees of freedom <= 0 for slice\n",
      "  warnings.warn(\"Degrees of freedom <= 0 for slice\", RuntimeWarning)\n",
      "C:\\Users\\Frank\\AppData\\Local\\Enthought\\Canopy\\edm\\envs\\User\\lib\\site-packages\\numpy\\lib\\function_base.py:2496: RuntimeWarning: divide by zero encountered in double_scalars\n",
      "  c *= 1. / np.float64(fact)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Til There Was You (1997)</th>\n",
       "      <td>0.872872</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1-900 (1994)</th>\n",
       "      <td>-0.645497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101 Dalmatians (1996)</th>\n",
       "      <td>0.211132</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12 Angry Men (1957)</th>\n",
       "      <td>0.184289</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>187 (1997)</th>\n",
       "      <td>0.027398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2 Days in the Valley (1996)</th>\n",
       "      <td>0.066654</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20,000 Leagues Under the Sea (1954)</th>\n",
       "      <td>0.289768</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001: A Space Odyssey (1968)</th>\n",
       "      <td>0.230884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39 Steps, The (1935)</th>\n",
       "      <td>0.106453</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8 1/2 (1963)</th>\n",
       "      <td>-0.142977</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            0\n",
       "title                                        \n",
       "'Til There Was You (1997)            0.872872\n",
       "1-900 (1994)                        -0.645497\n",
       "101 Dalmatians (1996)                0.211132\n",
       "12 Angry Men (1957)                  0.184289\n",
       "187 (1997)                           0.027398\n",
       "2 Days in the Valley (1996)          0.066654\n",
       "20,000 Leagues Under the Sea (1954)  0.289768\n",
       "2001: A Space Odyssey (1968)         0.230884\n",
       "39 Steps, The (1935)                 0.106453\n",
       "8 1/2 (1963)                        -0.142977"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "similarMovies = movieRatings.corrwith(starWarsRatings)\n",
    "similarMovies = similarMovies.dropna()\n",
    "df = pd.DataFrame(similarMovies)\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "(That warning is safe to ignore.) Let's sort the results by similarity score, and we should have the movies most similar to Star Wars! Except... we don't. These results make no sense at all! This is why it's important to know your data - clearly we missed something important."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "title\n",
       "No Escape (1994)                                                                     1.000000\n",
       "Man of the Year (1995)                                                               1.000000\n",
       "Hollow Reed (1996)                                                                   1.000000\n",
       "Commandments (1997)                                                                  1.000000\n",
       "Cosi (1996)                                                                          1.000000\n",
       "Stripes (1981)                                                                       1.000000\n",
       "Golden Earrings (1947)                                                               1.000000\n",
       "Mondo (1996)                                                                         1.000000\n",
       "Line King: Al Hirschfeld, The (1996)                                                 1.000000\n",
       "Outlaw, The (1943)                                                                   1.000000\n",
       "Hurricane Streets (1998)                                                             1.000000\n",
       "Scarlet Letter, The (1926)                                                           1.000000\n",
       "Safe Passage (1994)                                                                  1.000000\n",
       "Good Man in Africa, A (1994)                                                         1.000000\n",
       "Full Speed (1996)                                                                    1.000000\n",
       "Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)    1.000000\n",
       "Star Wars (1977)                                                                     1.000000\n",
       "Ed's Next Move (1996)                                                                1.000000\n",
       "Twisted (1996)                                                                       1.000000\n",
       "Beans of Egypt, Maine, The (1994)                                                    1.000000\n",
       "Last Time I Saw Paris, The (1954)                                                    1.000000\n",
       "Maya Lin: A Strong Clear Vision (1994)                                               1.000000\n",
       "Designated Mourner, The (1997)                                                       0.970725\n",
       "Albino Alligator (1996)                                                              0.968496\n",
       "Angel Baby (1995)                                                                    0.962250\n",
       "Prisoner of the Mountains (Kavkazsky Plennik) (1996)                                 0.927173\n",
       "Love in the Afternoon (1957)                                                         0.923381\n",
       "'Til There Was You (1997)                                                            0.872872\n",
       "A Chef in Love (1996)                                                                0.868599\n",
       "Guantanamera (1994)                                                                  0.866025\n",
       "                                                                                       ...   \n",
       "Pushing Hands (1992)                                                                -1.000000\n",
       "Lamerica (1994)                                                                     -1.000000\n",
       "Year of the Horse (1997)                                                            -1.000000\n",
       "Collectionneuse, La (1967)                                                          -1.000000\n",
       "Dream Man (1995)                                                                    -1.000000\n",
       "S.F.W. (1994)                                                                       -1.000000\n",
       "Nightwatch (1997)                                                                   -1.000000\n",
       "Squeeze (1996)                                                                      -1.000000\n",
       "Glass Shield, The (1994)                                                            -1.000000\n",
       "Slingshot, The (1993)                                                               -1.000000\n",
       "Lover's Knot (1996)                                                                 -1.000000\n",
       "Tough and Deadly (1995)                                                             -1.000000\n",
       "Sliding Doors (1998)                                                                -1.000000\n",
       "Show, The (1995)                                                                    -1.000000\n",
       "Nil By Mouth (1997)                                                                 -1.000000\n",
       "Fall (1997)                                                                         -1.000000\n",
       "Sudden Manhattan (1996)                                                             -1.000000\n",
       "Salut cousin! (1996)                                                                -1.000000\n",
       "Neon Bible, The (1995)                                                              -1.000000\n",
       "Crossfire (1947)                                                                    -1.000000\n",
       "Love and Death on Long Island (1997)                                                -1.000000\n",
       "For Ever Mozart (1996)                                                              -1.000000\n",
       "Swept from the Sea (1997)                                                           -1.000000\n",
       "Fille seule, La (A Single Girl) (1995)                                              -1.000000\n",
       "American Dream (1990)                                                               -1.000000\n",
       "Theodore Rex (1995)                                                                 -1.000000\n",
       "I Like It Like That (1994)                                                          -1.000000\n",
       "Two Deaths (1995)                                                                   -1.000000\n",
       "Roseanna's Grave (For Roseanna) (1997)                                              -1.000000\n",
       "Frankie Starlight (1995)                                                            -1.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "similarMovies.sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Our results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like Star Wars. So we need to get rid of movies that were only watched by a few people that are producing spurious results. Let's construct a new DataFrame that counts up how many ratings exist for each movie, and also the average rating while we're at it - that could also come in handy later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">rating</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>size</th>\n",
       "      <th>mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>'Til There Was You (1997)</th>\n",
       "      <td>9</td>\n",
       "      <td>2.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1-900 (1994)</th>\n",
       "      <td>5</td>\n",
       "      <td>2.600000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101 Dalmatians (1996)</th>\n",
       "      <td>109</td>\n",
       "      <td>2.908257</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12 Angry Men (1957)</th>\n",
       "      <td>125</td>\n",
       "      <td>4.344000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>187 (1997)</th>\n",
       "      <td>41</td>\n",
       "      <td>3.024390</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          rating          \n",
       "                            size      mean\n",
       "title                                     \n",
       "'Til There Was You (1997)      9  2.333333\n",
       "1-900 (1994)                   5  2.600000\n",
       "101 Dalmatians (1996)        109  2.908257\n",
       "12 Angry Men (1957)          125  4.344000\n",
       "187 (1997)                    41  3.024390"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "movieStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})\n",
    "movieStats.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">rating</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>size</th>\n",
       "      <th>mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Close Shave, A (1995)</th>\n",
       "      <td>112</td>\n",
       "      <td>4.491071</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Schindler's List (1993)</th>\n",
       "      <td>298</td>\n",
       "      <td>4.466443</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wrong Trousers, The (1993)</th>\n",
       "      <td>118</td>\n",
       "      <td>4.466102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Casablanca (1942)</th>\n",
       "      <td>243</td>\n",
       "      <td>4.456790</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Shawshank Redemption, The (1994)</th>\n",
       "      <td>283</td>\n",
       "      <td>4.445230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rear Window (1954)</th>\n",
       "      <td>209</td>\n",
       "      <td>4.387560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Usual Suspects, The (1995)</th>\n",
       "      <td>267</td>\n",
       "      <td>4.385768</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Star Wars (1977)</th>\n",
       "      <td>584</td>\n",
       "      <td>4.359589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12 Angry Men (1957)</th>\n",
       "      <td>125</td>\n",
       "      <td>4.344000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Citizen Kane (1941)</th>\n",
       "      <td>198</td>\n",
       "      <td>4.292929</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>To Kill a Mockingbird (1962)</th>\n",
       "      <td>219</td>\n",
       "      <td>4.292237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>One Flew Over the Cuckoo's Nest (1975)</th>\n",
       "      <td>264</td>\n",
       "      <td>4.291667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Silence of the Lambs, The (1991)</th>\n",
       "      <td>390</td>\n",
       "      <td>4.289744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>North by Northwest (1959)</th>\n",
       "      <td>179</td>\n",
       "      <td>4.284916</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Godfather, The (1972)</th>\n",
       "      <td>413</td>\n",
       "      <td>4.283293</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       rating          \n",
       "                                         size      mean\n",
       "title                                                  \n",
       "Close Shave, A (1995)                     112  4.491071\n",
       "Schindler's List (1993)                   298  4.466443\n",
       "Wrong Trousers, The (1993)                118  4.466102\n",
       "Casablanca (1942)                         243  4.456790\n",
       "Shawshank Redemption, The (1994)          283  4.445230\n",
       "Rear Window (1954)                        209  4.387560\n",
       "Usual Suspects, The (1995)                267  4.385768\n",
       "Star Wars (1977)                          584  4.359589\n",
       "12 Angry Men (1957)                       125  4.344000\n",
       "Citizen Kane (1941)                       198  4.292929\n",
       "To Kill a Mockingbird (1962)              219  4.292237\n",
       "One Flew Over the Cuckoo's Nest (1975)    264  4.291667\n",
       "Silence of the Lambs, The (1991)          390  4.289744\n",
       "North by Northwest (1959)                 179  4.284916\n",
       "Godfather, The (1972)                     413  4.283293"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "popularMovies = movieStats['rating']['size'] >= 100\n",
    "movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "100 might still be too low, but these results look pretty good as far as \"well rated movies that people have heard of.\" Let's join this data with our original set of similar movies to Star Wars:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Frank\\AppData\\Local\\Enthought\\Canopy\\edm\\envs\\User\\lib\\site-packages\\pandas\\tools\\merge.py:536: UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)\n",
      "  warnings.warn(msg, UserWarning)\n"
     ]
    }
   ],
   "source": [
    "df = movieStats[popularMovies].join(pd.DataFrame(similarMovies, columns=['similarity']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>(rating, size)</th>\n",
       "      <th>(rating, mean)</th>\n",
       "      <th>similarity</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>101 Dalmatians (1996)</th>\n",
       "      <td>109</td>\n",
       "      <td>2.908257</td>\n",
       "      <td>0.211132</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12 Angry Men (1957)</th>\n",
       "      <td>125</td>\n",
       "      <td>4.344000</td>\n",
       "      <td>0.184289</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001: A Space Odyssey (1968)</th>\n",
       "      <td>259</td>\n",
       "      <td>3.969112</td>\n",
       "      <td>0.230884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Absolute Power (1997)</th>\n",
       "      <td>127</td>\n",
       "      <td>3.370079</td>\n",
       "      <td>0.085440</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Abyss, The (1989)</th>\n",
       "      <td>151</td>\n",
       "      <td>3.589404</td>\n",
       "      <td>0.203709</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              (rating, size)  (rating, mean)  similarity\n",
       "title                                                                   \n",
       "101 Dalmatians (1996)                    109        2.908257    0.211132\n",
       "12 Angry Men (1957)                      125        4.344000    0.184289\n",
       "2001: A Space Odyssey (1968)             259        3.969112    0.230884\n",
       "Absolute Power (1997)                    127        3.370079    0.085440\n",
       "Abyss, The (1989)                        151        3.589404    0.203709"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "And, sort these new results by similarity score. That's more like it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>(rating, size)</th>\n",
       "      <th>(rating, mean)</th>\n",
       "      <th>similarity</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Star Wars (1977)</th>\n",
       "      <td>584</td>\n",
       "      <td>4.359589</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Empire Strikes Back, The (1980)</th>\n",
       "      <td>368</td>\n",
       "      <td>4.206522</td>\n",
       "      <td>0.748353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Return of the Jedi (1983)</th>\n",
       "      <td>507</td>\n",
       "      <td>4.007890</td>\n",
       "      <td>0.672556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Raiders of the Lost Ark (1981)</th>\n",
       "      <td>420</td>\n",
       "      <td>4.252381</td>\n",
       "      <td>0.536117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Austin Powers: International Man of Mystery (1997)</th>\n",
       "      <td>130</td>\n",
       "      <td>3.246154</td>\n",
       "      <td>0.377433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sting, The (1973)</th>\n",
       "      <td>241</td>\n",
       "      <td>4.058091</td>\n",
       "      <td>0.367538</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Indiana Jones and the Last Crusade (1989)</th>\n",
       "      <td>331</td>\n",
       "      <td>3.930514</td>\n",
       "      <td>0.350107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pinocchio (1940)</th>\n",
       "      <td>101</td>\n",
       "      <td>3.673267</td>\n",
       "      <td>0.347868</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Frighteners, The (1996)</th>\n",
       "      <td>115</td>\n",
       "      <td>3.234783</td>\n",
       "      <td>0.332729</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>L.A. Confidential (1997)</th>\n",
       "      <td>297</td>\n",
       "      <td>4.161616</td>\n",
       "      <td>0.319065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Wag the Dog (1997)</th>\n",
       "      <td>137</td>\n",
       "      <td>3.510949</td>\n",
       "      <td>0.318645</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dumbo (1941)</th>\n",
       "      <td>123</td>\n",
       "      <td>3.495935</td>\n",
       "      <td>0.317656</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bridge on the River Kwai, The (1957)</th>\n",
       "      <td>165</td>\n",
       "      <td>4.175758</td>\n",
       "      <td>0.316580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Philadelphia Story, The (1940)</th>\n",
       "      <td>104</td>\n",
       "      <td>4.115385</td>\n",
       "      <td>0.314272</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Miracle on 34th Street (1994)</th>\n",
       "      <td>101</td>\n",
       "      <td>3.722772</td>\n",
       "      <td>0.310921</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    (rating, size)  \\\n",
       "title                                                                \n",
       "Star Wars (1977)                                               584   \n",
       "Empire Strikes Back, The (1980)                                368   \n",
       "Return of the Jedi (1983)                                      507   \n",
       "Raiders of the Lost Ark (1981)                                 420   \n",
       "Austin Powers: International Man of Mystery (1997)             130   \n",
       "Sting, The (1973)                                              241   \n",
       "Indiana Jones and the Last Crusade (1989)                      331   \n",
       "Pinocchio (1940)                                               101   \n",
       "Frighteners, The (1996)                                        115   \n",
       "L.A. Confidential (1997)                                       297   \n",
       "Wag the Dog (1997)                                             137   \n",
       "Dumbo (1941)                                                   123   \n",
       "Bridge on the River Kwai, The (1957)                           165   \n",
       "Philadelphia Story, The (1940)                                 104   \n",
       "Miracle on 34th Street (1994)                                  101   \n",
       "\n",
       "                                                    (rating, mean)  similarity  \n",
       "title                                                                           \n",
       "Star Wars (1977)                                          4.359589    1.000000  \n",
       "Empire Strikes Back, The (1980)                           4.206522    0.748353  \n",
       "Return of the Jedi (1983)                                 4.007890    0.672556  \n",
       "Raiders of the Lost Ark (1981)                            4.252381    0.536117  \n",
       "Austin Powers: International Man of Mystery (1997)        3.246154    0.377433  \n",
       "Sting, The (1973)                                         4.058091    0.367538  \n",
       "Indiana Jones and the Last Crusade (1989)                 3.930514    0.350107  \n",
       "Pinocchio (1940)                                          3.673267    0.347868  \n",
       "Frighteners, The (1996)                                   3.234783    0.332729  \n",
       "L.A. Confidential (1997)                                  4.161616    0.319065  \n",
       "Wag the Dog (1997)                                        3.510949    0.318645  \n",
       "Dumbo (1941)                                              3.495935    0.317656  \n",
       "Bridge on the River Kwai, The (1957)                      4.175758    0.316580  \n",
       "Philadelphia Story, The (1940)                            4.115385    0.314272  \n",
       "Miracle on 34th Street (1994)                             3.722772    0.310921  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(['similarity'], ascending=False)[:15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Ideally we'd also filter out the movie we started from - of course Star Wars is 100% similar to itself. But otherwise these results aren't bad."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Activity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "100 was an arbitrarily chosen cutoff. Try different values - what effect does it have on the end results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
